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ON DECONVOLUTION OF DISTRIBUTION FUNCTIONS 
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The subject of this paper is the problem of nonparametric esti- 
mation of a continuous distribution function from observations with 
measurement errors. We study minimax complexity of this problem 
when unknown distribution has a density belonging to the Sobolev 
class, and the error density is ordinary smooth. We develop rate op- 
timal estimators based on direct inversion of empirical characteristic 
function. We also derive minimax afiine estimators of the distribution 
function which are given by an explicit convex optimization problem. 
Adaptive versions of these estimators are proposed, and some numer- 
ical results demonstrating good practical behavior of the developed 
procedures are presented. 

1. Introduction. In this paper we study the problem of estimating a 
distribution function in the presence of measurement errors. 

Let Xi, . . . ,Xn be a sequence of independent, identically distributed ran- 
dom variables with common distribution F. Suppose that we observe random 
variables Yi, . . . ,Yn given by 

(1) Yj = Xj+Cj, j = l,...,n, 

where Cj are i.i.d. random variables, independent of Xj^s with the density 
w.r.t. the Lebesgue measure on the real line. The objective is to estimate 
the value F^to) of the distribution function F of X at a given point to S M 
from the observations Y"" = (Yi, . . . , Yn). 

By an estimator we mean any measurable function F = F(Y'^) of the 
observations y". We adopt the minimax approach for measuring estimation 
accuracy. Let be a given family of probability distributions on M. Given 
an estimator F of F{tQ), we consider two types of maximal over risks: 



Received June 2010; revised May 2011. 

^Supported by BSF Grant 2006075. 

AMS 2000 subject classifications. 62G05, 62G20. 

Key words and phrases. Adaptive estimator, deconvolution, minimax risk, rates of con- 
vergence, distribution function. 

This is an electronic reprint of the original article published by the 
Institute of Mathematical Statistics in The Annals of Statistics, 
2011, Vol. 39, No. 5, 2477-2501. This reprint differs from the original in 
pagination and typographic detail. 



1 



2 



I. DATTNER, A. GOLDENSHLUGER AND A. JUDITSKY 



• quadratic risk, 



Risk2[F;^] := sup{E|F - F(to)|^}^/^ 



• e-risk: given a tolerance level e G (0, 1/2) we define 




An estimator F* is said to be rate optimal or optimal in order with respect 
to Risk if 



where inf is taken over all possible estimators of Fito), and C < oo is inde- 
pendent of n. We will be particularly interested in the classes of distributions 
having density with respect to the Lebesgue measure on the real line. 

The outlined problem is closely related to the density deconvolution prob- 
lem that has been extensively studied in the literature; see, for example, 
[4, 5, 13, 18, 24, 27, 28] and references therein. In these works the minimax 
rates of convergence have been derived under different assumptions on the 
error density and on the smoothness of the density to be estimated. Depend- 
ing on the tail behavior of the characteristic function of C, the following 
two cases are usually distinguished: 

(i) ordinary smooth errors, when the tails of are polynomial, that is. 



for some c > and /3 > 0. 

The afore cited papers derive minimax rates of convergence for different 
functional classes under ordinary smooth and supersmooth errors. 

In contrast to existence of the voluminous literature on density decon- 
volution, the problem of deconvolution of the distribution function F has 
attracted much less attention and has been studied in very few papers (see 
[24], Section 2.7.2, for a recent review of corresponding contributions). A con- 
sistent estimator of a distribution function from observations with additive 
Gaussian measurement errors was developed by [14]. A "plug-in" estimator 
based on integration of the density estimator in the density deconvolution 
problem has been studied under moment conditions on F in [28] . The paper 



Risk [F* ; J-] < C7 irif Risk [F ; J^] 



F 
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[13] also considered the estimator based on integration of the density decon- 
volution estimator. It was shown there that under a tail condition on F the 
estimator achieves optimal rates of convergence provided that the errors are 
supersmooth. For the case of ordinary smooth errors there is a gap between 
the upper and lower bounds reported in [13] which leaves open the ques- 
tion of constructing optimal estimators. More recently, some minimax rates 
of estimation of distribution functions in models with measurement errors 
were reported in [17]. Note also that [3] considered a general problem of op- 
timal and adaptive estimation of linear functionals i{f) = J"^^ (j){t)f{t) dt in 
the model (1). However, their results hold only for representative <j) G Li(]R) 
which is clearly not the case in the problem of recovery of distribution func- 
tion. 

The objective of this paper is to develop optimal methods of minimax de- 
convolution of distribution functions and to answer several questions raised 
by known results on this problem: Is a smoothness assumption alone on F 
sufficient in order to secure minimax rates of estimation of the sort 0{n~'^) 
for 7 > in the case of ordinary smooth errors? Do we need tail or moment 
conditions on F? 

Our contribution is two-fold. First, we characterize the minimax rates 
of convergence in the case when the unknown distribution belongs to a 
Sobolev ball, and the observation errors are ordinary smooth. The rates 
of convergence depend crucially on the relation between the smoothness 
index a of the Sobolev ball and the parameter j3 [the rate at which the 
characteristic function of errors tends to zero; see (i) above]. In contrast 
to the density deconvolution problem, it turns out that there are different 
regions in the (a,/3)-plane where different rates of convergence are attained. 
We show that in some regions of the (a,/3)-plane the minimax rates of 
convergence are attained by a linear estimator, which is based on direct 
inversion of the distribution function from the corresponding characteristic 
function; cf. [17]. It is worth noting that we do not require any additional 
tail or moment conditions on the unknown distribution. In the case when 
the parameters of the regularity class of the distribution F are unknown, we 
also construct an adaptive estimator based on Lepski's adaptation scheme 
[23]. The e-risk of this estimator is within a In Inn-factor of the minimax 
e-risk. 

Second, using recent results on estimating linear functionals developed in 
[19], we propose minimax and adaptive affine estimators of the cumulative 
distribution function for a discrete distribution deconvolution problem; see 
also [6, 9-12] for the general theory of affine estimation. These estimators 
can be applied to the original deconvolution problem provided that it can 
be efficiently discretized. By efficient discretization we mean that: 
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(1) the support of the distributions of X (Y) can be "compactified" [one 
can point out a compact subset of M such that the probability of X (Y) 
being outside this set is "small"] and binned into small intervals; 

(2) the class X of discrete distributions, obtained by the corresponding 
finite-dimensional cross-section of the class of continuous distributions is 
a computationally tractable convex closed set.^ 

Under these conditions one can efficiently implement the minimax affine 
estimator for F based on the approach proposed in [19]. This estimator is 
rate minimax with respect to Risk^ (within a factor ~ 2 for small e) whatever 
are the noise distribution and a convex and closed class X . 

We describe construction of the minimax affine estimator of F when the 
class X is known and provide an adaptive version of the estimation procedure 
when the available information allows us to construct an embedded family 
of classes. 

The rest of the paper is structured as follows. We present our results on 
estimation over the Sobolev classes in Section 2. Section 3 deals with mini- 
max and adaptive afhne estimation. Section 4 presents a numerical study of 
proposed adaptive estimators and discusses their relative merits. Proofs of 
all results are given in the supplementary article [8]. 

2. Estimation over Sobolev classes. 

2.1. Notation. We denote by fy and the densities of random variables 
Y and Q; with certain abuse of notation we simply denote by / the density 
of unknown distribution of X. 

Let g' be a function on M; we denote by g the Fourier transform of 5, 

/oo 
g{x)e''^^ dx, G M. 

-00 

We consider the classes of absolutely continuous distributions. 

Definition 2.1. Let a > -1/2, L > 0. We say that F belongs to the 
class J-a{L) if it has a density / with respect to the Lebesgue measure on 
M, and 

1 f°° ^ 



^Roughly speaking, a computationally tractable set can be interpreted as a set given 
by a finite system of inequalities Piix) < 0, i = 1, . . . ,m, where pi are convex polynomials; 
see, for example, [2], Chapter 4. 
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The set J-a{L) with a > —1/2 contains absolutely continuous distribu- 
tions. If a > 1/2, then the distributions F from Fa{L) have bounded con- 
tinuous densities. Usually J-a{L) is referred to as the Sobolev class. 

We use extensively the following inversion formula: for a continuous dis- 
tribution F one has 

(2) F{x) = / tJ^^9{e-*'^^/(w)}dw, xGM, 

where stands for the imaginary part, and the above integral is inter- 
preted as an improper Riemann integral lim^^-^oo /i/7-<^~^3'{e~*'^^/((xi)} dtj. 
For the proof of (2) see [15, 16] and [20], Section 4.3. 

Throughout this section we assume that the error characteristic function 
does not vanish: 

This is a standard assumption in deconvolution problems. 

2.2. Minimax rates of estimation. In model (1) we have /(<^) = /y (a;)//^(a;), 
and /y(f^) can be easily estimated by the empirical characteristic function 
of the observations Y. This motivates the following construction: for A > 
we define the estimator Fx of -F(to) by 

(3) Fx = ---y- -^{^ \du. 

Here A is the design parameter to be specified. Note that if the density is 
symmetric around the origin, then is real, and the estimator F\{to) takes 
the form (cf. [17]) 



n ^ — ' vr 



f(;{uj)uj 



Note that Fx may be truncated to the interval [0, 1] ; obviously, the risk of 
such a "projected" estimator is smaller than that of Fx- 

Our current goal is to establish an upper bound on the risk of the esti- 
mator Fx over the classes J-a{L). We need the following assumptions on the 
distribution of the measurement errors Qi: 

(El) There exist real numbers /3 > 0, > and > such that 

(E2) There exist positive real numbers wq, and r such that 
\Ji^{uj)\>l-bi^\uo[- V|^^|<wo. 
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Fig. 1. Division of the parameter set for (a,/3). 

Assumption (El) characterizes the case of the ordinary smooth errors. 
Assumption (E2) describes the local behavior of fc_ near the origin. It is 
well known that for any distribution of a nondegenerate random variable 
there exist positive constants h and 5 such that < 1 — for all 

\uj\ < 5 (see, e.g., [25], Lemma 1.5). Thus in (E2) we have r E (0,2]. Typi- 
cal examples of distributions satisfying (El) and (E2) are the Laplace and 
Gamma distributions. For example, for the Laplace distribution (El) holds 
with /? = 2, and (E2) holds with r = 2. The Gamma distribution provides 
an example of the distribution satisfying (El) with /5 > being the shape 
parameter of the distribution. _ 

As we will see in the sequel, the rates of convergence of the risks Risk2 [Fx ; 
J-q(L)] and ^\sk f[Fx]Ta{L)] are mainly determined by the relationship be- 
tween parameters a and /3. Consider the following two subsets of the pa- 
rameter set := {(a, /3) : q > — 1/2, /? > 0} for the pair (a, /3): 

Gr := {(a,/3) G G : a + /3 > 1/2}, Gs := {(a, /3) G G : a + /3 < 1/2}. 

If (a,/3) G Gs, then necessarily ^ Li(R); in addition, because a < 1/2, 
the density / can be discontinuous. That is why we will refer to Gg as the 
singular zone, while the subset Gr will be called the regular zone. We denote 
by Gb the border zone between Gr and Gg: 

Gb:={(a,/3) gG :« + /? = 1/2}. 

Division of the parameter set G into zones Gr, Gg and Gb is displayed in 
Figure 1. The figure also shows the sub-regions Gr,i and &s,i, ^ = 1)2, that 
are defined by the following formulas: 

Qr,i := {(a, /3) G Gr : /3 > 1/2}, Qr,2 := {(a, /?) G Gr : /? < 1/2}, 

Qs,i := {(a, /3) G Gs : a + 3/3 > 1/2}, Qg,2 := {(a, /3) G Gg : a + 3/3 < 1/2}. 
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The next two theorems present bounds on the risks in the regular zone: 
Theorem 2.1 states upper bounds on the risks of Fx, while Theorem 2.2 
contains the corresponding lower bounds on the minimax risks. 

For z>l define 

rz-(2"+l)/(4"+4/3), /3>l/2, 
A(z) = ^l/[2«+{2/3Vl)] ^ ^(^) ^ j yj^^ ^ ^ ^/2, 

[l/V^, /3 6(0,1/2). 

Theorem 2.1. Let assumptions (El) and (E2) hold, and suppose that 
{a,f3) G 0r. If Fx^ is estimator (3) associated with = Ci{a, L)X{n) , then 
for all to gM. and large enough n, 

Risk2[FA,;J-„(L)] <^„(a,L) := ^2(0, L)^(n). 

In addition, if X^, = Ci(q, L)A(n/ ln[2e^^]), then for all G CL^d large 
enough n, 

Risk,[FA,;J-,(L)] < V'n,.(a,i>) := C3(a,L)^(n/ln[2e-i]), 

provided that e > 2exp{— C4(a, L)n}. The constants Ci, i = l,...,4, are 
specified in the proof of the theorem (see (A.15)-(A.22) in [8]). 

Theorem 2.1 shows that if (a,/3) is in the regular zone Br and /3 G (0, 1/2), 
then the estimator Fx^ attains the parametric rate of convergence. In the 
case fi = 1/2 this rate is within a logarithmic factor of the parametric rate. 
The natural question is if the estimator Fx^ is rate optimal whenever /? > 
1/2, and (a,/3) G @r- The answer is provided by Theorem 2.2. 

We need the following assumption. 

(E3) The characteristic function /^^ is twice differentiable, and there exist 
real numbers /3>l/2, C^>0 and > such that 

(1 + ioy/^ max^{\f^^'\co)\} < Cc V|^| > a;.. 

Assumption (E3) is rather standard in derivations of lower bounds for decon- 
volution problems. This assumption should be compared to condition (G3) 
in [13]; it is assumed there that for j = 0, 1,2 one has \f^'^\uj)\\uj\^~^^ < 
as |a;| — )• 00. Note that (E3) is a weaker assumption. 

Theorem 2.2. Let assumption (E3) hold. Suppose that the class J-a{L) 
is such that > 7r^^2i+("^i)+r(2Q + 1) and q > 1/2. Then there exist 
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Table 1 

The bandwidth order A(n) and the convergence rate of the maximal risk ip{n) in the 

singular and border zones 

Border zone ©b: a + /3 = 1/2 Singular zone ©s: a + /3 < 1/2 



/3>l/2 13 = 1/2 /3<l/2 a + 3/3 > 1/2 a + 3/3 < 1/2 

A(n) ^ ( n ^l/(2.+l) ^2/(2a+3-2« 

^(n) (Vi^\a + l/2 {Innf/^ (hm)^/'^ {2<:<+ 1 ) / (2q + 3- 2/3 



constants ci and C2 depending on a, (3 and only such that, for all n large 
enough, 

inf Risk2[F; J-,(L)] > ciL(2/5-i)/(2°+2/3)0„, 
F 

inf Risk,[F; J-,(L)] > C2L(2/^-1)/(2"+2/3)^^^^, 
F 

where (l)n '■= 4>{n), (f)n,e ■= 4>{n/ Ine^^) , 4>{z) := , and inf is 

taken over all possible estimators of F{tQ). 

The results of Theorems 2.1 and 2.2 deal with the regular zone. While we 
do not present the lower bound for the case of a < 1/2 we expect that the 
bounds of Theorem 2.2 hold for the whole regular zone. 

It is important to realize that the risks of Fx converge to zero for all 
{a, 13) £ Q, and, in particular, for {a,/3) S 0s and (a,/3) G 0b. The next 
statement establishes upper bounds on Risk2[Fx; J-o(L)] in the singular and 
border zones, 0s and 0b. 

Theorem 2.3. Let assumptions (El) and (E2) hold. If F^^ is the esti- 
mator (3) associated with = Ci(a,L)A(n), then for all to S ffi and large 
enough n 

Risk2[FA.;-^a(^)] <C2(a,L)c^(n), 

where the sequences A(n) and (p{n) are given in Table 1, and constants Ci 
and C2 are specified in the proof (see (A.15)-(A.22) in [8]). In addition, if 
A* = 6*3(0, L)A(n/ ln[2e~^]), then for large enough n 

RiskjFA,; J-,(L)] < C4(a,L)(^(n/ln[2e-i]). 
Several remarks on the results of Theorems 2.1-2.3 are in order. 
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Remarks. (1) Theorem 2.1 shows that the regular zone Or is decomposed 
into three disjoint regions with respect to the upper bounds on the risks of 
Fx_^- In the zone Gr,2 where /3 < 1/2, the rates of convergence are paramet- 
ric; because of roughness of the error density, here the estimation problem 
is essentially a parametric one. The region Qj- i is characterized by nonpara- 
metric rates, while in the border zone between Gr_i and 0r^2 (/5 = 1/2) the 
rate of convergence differs from the parametric one by a Inn-factor. 

(2) The condition on L stated in Theorem 2.2 is purely technical; it re- 
quires that the family J-a{L) is rich enough. It follows from Theorems 2.1 
and 2.2 that the estimator Fx^ is optimal in order in the regular zone if 
a > 1/2. 

(3) The subdivision of the singular zone Gg into two zones Bg^i = {(a, (3) G 
0s : 3/3 -|- a > |} and ©3,2 = {(a, /?) G ©s : 3/3 -|- a < |} is a consequence of two 
types of upper bounds that we have on the variance term; see (14) in [8]. 
In the border zone ©b the upper bounds on the risk differ from those in 
the regular zone only by logarithmic in n factors. We do not know if the 
estimator Fx^ is rate optimal in the singular and border zones. 

(4) Note that the results of Theorems 2.1 and 2.3, when put together, 
allow us to establish risk bounds for any pair (a, /3) from the parameter set 
= {(a, /3) : a > — 1/2, /3 > 0}. In particular, for any fixed a > —1/2, the rate 
of convergence of the maximal risk approaches the parametric rate when /3 
approaches zero. We would like to stress the fact that no tails or moment 
conditions on F are required to obtain these results; such conditions were 
systematically imposed in the previous work on deconvolution of distribution 
functions. 

2.3. Adaptive estimation. The choice of the smoothing parameter A in 
(3) is crucial in order to achieve the optimal estimation accuracy. As Theo- 
rems 2.1 and 2.2 show, if parameters a and L of the class J-a{L) are known, 
then one can choose A in such a way that the resulting estimator is optimal in 
order. In practice the functional class Ta{L) is hardly known; in these situa- 
tions the estimator of Section 2 cannot be implemented. Note, however, that 
this does not pose a serious problem in the regular zone when /3 S (0, 1/2). 
Indeed, here if we choose A = -^/n, then the resulting estimator will be opti- 
mal in order for any functional class Fa{L) satisfying A* = A*(q;,L) < ^/n, 
where A^, is defined in Theorem 2.1. 

The situation is completely different in the case /3 > 1/2. In this section 
we develop an estimator that is nearly optimal for the e-risk over a scale of 
classes 7q,(L). The construction of our adaptive estimator is based on the 
general scheme by [23]. 
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2.3.1. Estimator construction. Consider the family of estimators {Fx, A G 
A}, where Fx is defined in (3), A := {Xj,j = 1,...,A^} with Amm := Ai, 
Amax := Aat, and Xj = 2-'Amiin J = 2,...,A^. The adaptive estimator F is 
obtained by selection from the family {Fx, X & A} according to the following 
rule. 

Let 

(4) wi:=min{^^o,(4fec)^'/"}, := 27r-^[2 + (I/t)]^ , 

where constants loq, h(_ and r appear in assumption (E2). For any A G A we 
define 

(5) . 

~2 

2^1, := max a,,. 

Note that can be computed from the data (the parameters r and wi are 
determined completely by hence they are known). In fact, a\n~^ is a 
plug-in estimator of an upper bound on the variance of F\, while is a 
"monotonization" of a\ with respect to A. 
Define 



V 



2 ._ 
A •- 



T.\ + llm'^X^^n-^\Ti{AN^e-^), A G A, 



where 

(6) m := V2^+ (^c^/3)-i2^+('5/2-i)+ [2 + /31n+(lM)], 

and constant appears in assumption (El). 

Let := \/2(\/2 - 1)"^[1 + ^\n{'iN e~^)]; then with every estimator Fx, 
A G A we associate the interval 

(7) Qx := [Fa - ^vxn~^'^,Fx + '&vxn~^/\ 
Define 

(8) A:=min|AGA: f] Q^/0|, 
and set finally 

(9) F:=F3;. 

Note that A is well defined: the intersection in (8) is nonempty for A = Amax- 



ON DECONVOLUTION OF DISTRIBUTION FUNCTIONS 



11 



2.3.2. Oracle inequality. We will show that the estimator F mimics the 
oracle estimator Fq which is defined as follows: 
Let 



2 



9<! ^ duj 



T,\ := max a^, A G A. 



It is shown in the proof of Lemma 5.2 (see Section A. 1.2 in [8]) that aj^n ^ 
is an upper bound on the variance of the estimator Fx associated with 
parameter A. Note that defined in (5) is the empirical counterpart of the 
quantity a^- Define 

vl := si + llfh^X^'^n-^ ln{4N^e-^). 
Given a > and L > let 

Ao = Ao(a, L) := min{A G A : vxii'^^'^ > 2\f2'K-^l'^L\-'^-^l'^} 

and define Fq := Fx^. 

The oracle estimator Fo has attractive minimax properties over classes 
J-a{L). In particular, it is easily verified that for any class J-a{L) such that 
Ao < [ll?n2ln(4iVe~^)]~^n one has 

Risk,[Fo; < 2vx,n^^/'^ < Kiipn,e{a,L) + X20„,e- 

Here 'il)n,t is the upper bound of Theorem 2.1 on the risk of the estimator 
Fx^ that "knows" a and L, (j)n,e is defined in Theorem 2.2, and xi and j<2 
are constants independent of a and L. Thus, the risk of thejaracle estimator 
admits the same upper bound as the risk of the estimator Fx^ that is based 
on the knowledge of the class parameters a and L. 

Now we are in a position to state a bound on the risk of the estimator F^ 

Theorem 2.4. Suppose that assumptions (El), (E2) hold, /3> 1/2 and 

let 

Amax = [lW ln(4iVe-i)]-in. 

If Fj^ is the estimator defined in (7)-(9) then for any class J-'a{L) with a > 
such that Amin ^ Ao(q,L) < Amax) One has 

msK[F~^;Ta{L)] < {3-l/V2)^vx^n~'/^. 

Estimator (7)-(9) attains the optimal rates of convergence with respect 
to e-risk within a ln(A^e~^)-factor over the collection of functional classes 
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J-a{L). In particular, if Amin is chosen to be a constant, and Amax ^ n 
for some / > 1, then = hi(Amax/Amin)/lii2 x Inn, and the e-risk of the 
adaptive estimator is within a In Inn-factor of the minimax e-risk for a 
scale of Sobolev classes. It can be shown that this In Inn-factor is unavoidable 
price for adaptation when the accuracy is measured by the e-risk; see, for 
example, [26]. 

3. Minimax and adaptive affine estimation in discrete deconvolution model. 

The results of Section 2 imply that in the regular zone the minimax rates 
of convergence on the Sobolev classes are attained by linear estimator (3). 
It seems interesting to compare the performance of estimator (3) and its 
adaptive version in Section 2.3 with that of the minimax linear estimator. 
Consider the estimation problem as follows; cf. [19], Problem 2.2: 

Problem D. We observe n independent realizations rji, . . . ,r]n oi a ran- 
dom variable r], taking values in S = {1, . . . , m}. The distribution of r] is iden- 
tified with a vector p from the m-dimensional simplex Vm = {y £ W"" : y > 
O'Sj^i = 1} by setting = P{r? = k}, 1 < k < m. Suppose that vector p 
is affinely parameterized by an M-dimensional "signal" -vector of unknown 
"parameters" x € X G Vm -P = = [[Ax]i; . . . ; [^x]m]- Here Ax is the lin- 
ear mapping with AX C Vm, and [a]j stands for the jth element of a. Our 
goal is to estimate a given linear form g{x) = g^x at the point x underlying 
the observation ry". 

It is obvious that if distributions of X and C are compactly supported, 
or can be "compactified" (i.e., for any e > one can point out bounded 
intervals of probability 1 — e for X and C,)^ then under very minor regularity 
conditions on and F, the Problem D approximates the initial distribution 
deconvolution problem with "arbitrary accuracy." The latter means that 
given e > we can compile the discretized problem such that its (5-solution 
is the solution to the initial continuous problem with the accuracy 5-\-e with 
probability 1 — e. 

We consider the following discretization of the deconvolution problem: 

(1) Let J= [ao,am] be the (finite) observation domain, and let oq < 
ai < 02 < • • • < am-i < dm- We split J into m intervals Ji = [ao,ai], J2 = 
(01,02], ...,Jm = {am.-i,am]- We denote pk = F{Y G J^}, k = l,...,m. 

(2) Suppose that the (finite) interval / = [&0)^A/] contains the support 
of all F £ J^. Let bQ < bi < b2 < ■ ■ ■ < bM , we partition I into M intervals 
h = [bo,bi],l2 = {bi,b2],...,lM = {bM-i,bM]- We denote Xk = P{X £ 4}, 
k = l,...,M. 
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(3) Denote bk = {b^-i + bk)/2. Define the m x M matrix A = (Ajk) witli 
elements 

Ajk = P{bk + C^Jj} 

^ ( P{ao -h<C<ai- h}, k = l,...,M,j = l, 

\ P{aj_i -bk<C<aj - bk}, k = l,.. .,M,j = 2, . . . ,m, 

and the vector g = g{to) € M^^, with = l(bk < to), k = l,...,M. The ele- 
ments Ajk of A are the approximations of conditional probabilities P{Y G 
Jj\X G Ik}, and g'^x is an approximation of -F(to)- 

(4) Consider discrete observations r]i £ {1,. . . ,m} as follows: 

m 

m = l(ao <Yi< ai) + • l(oj_i <Yi< aj), i = l,...,n. 

If the sets / and J are selected so that P{X £/}>!-£, P{Y e J}>1- e 
for any F ^ if is the class of "regular distributions" and the noise 
distribution possesses some regularity, and if the partitions of / and J are 
"fine enough," then solving Problem D with X being the corresponding M- 
dimensional cross-section of J- will provide us with an estimation Tj of -F(to) 
in the continuous deconvolution problem. 

We now concentrate on solving the deconvolution problem in the discrete 
model. 

3.1. Minimax estimation in the discrete model. An estimate of g{x) — a 
candidate solution to our problem — is a measurable function g = g{rp) : — s- 
M. Given tolerance e € (0, 1), we define the e-risk of such an estimate on X 
as 

Risk,(5; X) = inf |<5 : sup P.{|5(??") - 9^ A ><5} < e|, 

where P^,. stands for the distribution of observations associated with the 
"signal" x. The minimax optimal e-risk is 

Risk* (A^) = inf Risk, (5; ) . 

!?(•) 

We are particularly interested in the family of estimators of the following 
structure: 

^ n ^ n m 

g^A'n^) = ~Y^ '^(^*) + ^ = ~J2Y^ fkHvi = k) + c. 

4=1 i=l k=l 

We refer to such estimators g^p as affine. In other words, g^^ is an affine 
function of empirical distribution: for some 99 € and c £ M, 

m 

g^AV^) = X] ^kPn{k) + C, 
fc=l 
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where Pn is the emphical distribution of the observation sample Pn{k) = 
-^"^^1(774 = k). An important property of the class of affine estimators, 
when applied to Problem D with convex set X, is that one can choose 
an estimator from the class such that its e-risk attains (up to a moderate 
constant ~ 2; see Theorem 3.1 below) the minimax e-risk Risk* (A'). 

From now on let us assume that X C is a convex closed (and, being 
a subset of an M-dimensional simplex, compact) set. 

Let us consider the affine estimator of g^x 

m 

geiv"") = h,civl = ^^kPnik) + c, 

k=l 

in which the parameters (p and c of g^ are defined as follows. 
Consider the optimization problem 



S'(e) = max 

x,y&X 



(10) 





h{x,y;e) = nln[\J [Ax], [Ay], + ln(2/e) > 



Let {x,y) be an optimal solution to (10), and let > be the Lagrange 
multiplier of the constraint h{x,y; e) > 0. We set 



25^[y"l"-^] and (pj = unln 



We have the following result. 



' [Ay], 
[Ax]j 



j = l,...,m. 



Theorem 3.1. Let e G (0, 1/4]. Then the e-risk of the estimator g^ sat- 
isfies 

21n(2/e) 



(11) msk,{g,;X)<S{e)<^{e)msk*{X), T9(e) 



ln[l/(46) 



Note that ??(e) —7-2 as e — t- 0; thus for small tolerance levels the e-risk of 
the estimator g^ is within factor ~ 2 of the minimax e-risk. It is important 
to emphasize that g^ is readily given by a solution to the explicit convex 
program (10), and as such, it can be found in a computationally efficient 
fashion, provided that X is computationally tractable. 

In the "historical perspective" the affine estimator g^ represents an alter- 
native to the binary search estimator gs, proposed in [10] for the case of 
"direct" observations. It can be shown that the e-risk Kisk^{gB', ^) of that 
estimator satisfies Riskf:{gB','^) < C Risk* (A') for small e (e.g., one can prove 
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that C < 26 whenever e < 0.01). To the best of our knowledge, risk bound 
(11) in Theorem 3.1 for the estimator is much better than those available 
for the binary search estimator. 

Note that the constraint h{x, y;e)>0 of the problem (10) can be rewritten 
as follows: 

p(x,y)>(e/2)i/", 

where 

m 

p{x, y) = Y^ y/[Ax]k[Ay]k 

k=l 

is the Hellinger affinity of distributions A{x) and A{y); cf. [21] and [22], 
Chapter 4. Thus the optimal value S{e) of the optimization problem (10) 
can be seen as modulus of continuity of the linear functional g{-) over the 
class X of distributions "with respect to Hellinger affinity." If ^ ln[l/e] = o(l) 
we have p{x,y) ^ 1 and 

H'^{x, y) = l- p{x, y)^- lnp{x, y), 

where H{x,y) is the Hellinger distance between x and y. In this limit we 
have 

- 1 / [WM\ _ / 1 T, ^ W I^SAX 

i>(e « -uj\ \ -— = max { -q ly — x), H {x,y) < \ >. 

2 y\ 2N J x,y&xy2^ ^' ^ '^^-y 2n J 

Here ti;(-) is the "modulus of continuity of g over X with respect to Hellinger 
distance," introduced in [10]. Therefore, bound (11) can be seen as a finite- 
dimensional nonasymptotic counterpart of [10], Theorem 3.1. 

3.2. Adaptive version of the estimate. Consider a modification of our 
estimation problem where the set X, instead of being given in advance, is 
known to be one of the sets from the collection of nonempty convex compact 
sets Af-^, Af^, . . . , Af^ in M*'^. We aim to construct an adaptive estimator of 
the linear form g'^x, given that x is an element of some Xi in the collection. 
Here we consider the simple case where the sets are nested. X^ C X^ C • • • C 
X^. Note that in the case of nonnested sets an adaptive estimator can be 
constructed following the ideas of [7]. 

Given a linear form g'^z on M^-'^, let Risk*''(^) and Risk^ be, respectively, 
the e-risk of an estimate g on X^ , and the minimax optimal e-risk of recov- 
ering g'^x on X^. Let also 5fc(-) be the function 5(-) in (10) associated with 
X = X^ . As it is immediately seen, the functions S'fc(-) grow with k. Our 
goal is to modify the estimate g we have built in such a way that the e-risk 
of the modified estimate on X^ will be "nearly" Risk!^ for every k < N. This 
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goal can be achieved by a straightforward apphcation of Lepski's adaptation 
scheme as follows. 

Given e > 0, let g''{-) be the affine estimate with the (e/A^)-risk on 
not exceeding Sfc(e/iV) as provided by Theorem 3.1 which is applied with 
e/N substituted for e and substituted for X. Then 

sup PAlaH^l - /^l > Sk{e/N)} < e/N VA: < N. 

Given observation r/", let us say that the index k < N is rj^-good, if for all 
k' satisfying k < k' < N one has 

Note that //"-good indices do exist (e.g., k = N). Given r/", we can find 
the smallest //"-good index k = kirf")] our estimate is nothing but gii]^) = 
5'=(<)(r/"). 

Proposition 3.1. Assume that eG (0,1/4), and let 

ln(2iVA) 
ln(2A) • 

Then 

sup P..{|5«) - > ^Sk{e)} < e y{k, l<k<N); 
whence also 

msk\g) < ^^^f^^ Risk^ V(fe, l<k<N). 
ln[l/(4e)J 

The proof of the proposition follows exactly same steps as that of Propo- 
sition 5.1 of [19], and it is omitted. 

4. Numerical examples. To illustrate our results we present here exam- 
ples of implementation of the adaptive estimation procedures of Sections 2.3 
and 3.2. 

We consider three measurement error distributions scenarios: 

(i) Gamma distribution r(0, 2, l/(2\/2)) with the shape parameter 2 
and the scale (the standard deviation of the error is equal to 0.5). Here 
T{fi,a,6) stands for the Gamma distribution with location /i, shape param- 
eter Q and scale 9, such that its density is [T{a)6'^]~^{x — exp{ — (x — 

f,)/e}i{x>ti). 

(ii) Mixture of Laplace distributions ^£(— 1,^) + |£(1,^); here £(/n,a) 
stands for the Laplace distribution with the density {2a)~^e~^^~^^^"' . 
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(hi) Normal mixture ^M{0, \) + ^Af{2, 1). 
We consider three distributions of X: 

(1) mixture of "shifted" Gamma distributions: 0.3r(0, 0.5, 2) + 0.7r(5, 0.5, 2); 

(2) mixture of Laplace distributions 0.3£(-1.5, 0.5) + 0.7£(1.7, 0.25); 

(3) normal mixture 0.67V(0.15827, 1) + 0.4AA(1, 0.0150). 

Note that in the case (i) of T(0,2,6) error distribution the estimator (3) 
can be computed explicitly: we have P\ = ^ — ^27=1 ^>^0^i ~ *o)) where 

hiy) = Si(Ay) + y-^[9^Xcos{Xy) - 2esm{Xy)] - y~^9^sm{Xy), 

and Si(2;) = uj~^ sin u dco is the sine integral function. Then the adaptive 
estimation algorithm of Section 2.3 is implemented for the grid A = {A G 
[0.01:0.05:10]}. 

Estimation procedures, described in Section 3.1, were implemented using 
Mosek optimization software [1]. The observation space and the signal space 
were split into m = M = 200 bins. The adaptation procedure was imple- 
mented over 17 linear estimators corresponding to the classes X^,...,?^^'^ 
of "Lipschitz-continuous" discrete distributions with Lipschitz constants on 
the geometric grid, scaled from 0.001 to 1 [if reduced to continuous densities, 
it corresponds to the approximate range of Lipschitz constant from O(O.l) 
to O(IOO)]. 

The simulation has been repeated for 100 observation samples of size n = 
2,000. On Figure 2 we present simulation results for the scenario (i) when the 
error distribution follows the r(0, 2, l/{2^/2)) law. The left column displays 
"typical" results of estimation corresponding to three signal distributions. 
We present the true distribution (solid line), the estimate of Section 
2.3 (dotted line), the estimate g{vi"') of Section 3.2 (dashed line) and the 
empirical distribution of the observations (dash-dot line). The boxplots on 
the right display resume the corresponding empirical distributions of the 
maximal estimation error over 50 points of the regular grid on the support 
of / for two estimators: (a) for g{r]^) of Section 3.2 and (b) for the Fj of 
Section 2.3. On Figure 3 we present "typical" results for adaptive estimator 
g{rf') of Section 3.2 under the error scenarios (ii) (on the left) and (iii) 
(on the right). Similarly to Figure 2 we plot true cdf (solid line), adaptive 
estimator g(?/") of Section 3.2 (dashed line) and the observation edf (dash- 
dot line). The results of this simulation are summarized on Figure 4. The first 
boxplot (the left column plots) represents the distribution of the maximal 
estimation error over 50 points of the regular grid on the support of /. Next, 
for each point in the grid we compute the maximal estimation error over 100 
simulations, the distribution of maximal errors "over the points of the grid" 
is represented on the second boxplot (plots on the right column). 
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Fig. 2. Simulation results for the Gamma error scenario. On the left: true cdf (solid 
line), adaptive estimator girj") of Section 3.2 (dashed line), adaptive estimator of 
Section 2.3 (dotted line) and the edf of the observations (dash-dot line). On the right: the 
boxplots of the maximal estimation error ofg{rj") (a) and Fy^ (b). 

Remarks. The numerical examples in this section illustrate strong and 
weak points of the proposed estimators related to practical implementation. 
They can be summarized as follows. 

The adaptive estimator of Section 2.3 is based on the choice of the unique 
smoothing parameter A. This imposes a "natural" family of nested classes 
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Fig. 3. Simulation results: true cdf (solid line), adaptive estimator (dashed line) and 
empirical distribution function of the observation (dash-dot line). On the left, (ii) are the 
results for mixed Laplace noise; on the right, (iii) are the results for the mixed normal 
noise. 



and facilitates implementation of the adaptation scheme. Yet, this estima- 
tor should be "explicitly tuned" for a specific distribution of the errors. In 
particular, the integral computation in (3) for a given distribution of C may 
become very tedious. Even though our theoretical results are proved under 
the condition that ^ for all oj G M, in practical implementation the 
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Fig. 4. Estimation error distribution. Left column: empirical distribution of the maximal 
error of estimation over a regular grid; right column: distribution of the maximal over 100 
simulations estimation error over the points of the grid. On each plot the left boxplot (a) 
corresponds to the mixed Laplace noise, while the right boxplot (b) corresponds to the mixed 
normal noise. 

estimator (3) could be modified in order to allow characteristic functions 
vanishing at finite number of points on M. In this case the integration 
domain in (3) should exclude some properly specified vicinities of the points 
where vanishes. 

In contrast to this, the adaptive estimator in Section 3.2 can be easily 
tuned to any noise distribution and convex target distribution class. For 
instance, the characteristic function of noise in the Laplace scenario (ii) 
vanishes at some points, what precludes the possibility of utilizing the esti- 
mator of Section 2.3 without proper modifications. Note that one can easily 
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incorporate any additional available information on the unknown distribu- 
tion that can be expressed as a convex constraint in the corresponding opti- 
mization problem. The typical examples of such constraints are unimodality, 
symmetry, monotonicity and moment bounds. However, this freedom comes 
at a price: the family ^ C • • • C of the embedded classes for the adaptive 
estimator in Section 3.2 should be constructed "by hand." The computation 
of the adaptive affine estimator of Section 3.2 is also a heavy numerical task. 
In particular, in our setting it involves solving 17 conic quadratic optimiza- 
tion problems with 1,006 variables, 809 linear and 202 conic constraints. 

It is well known that the normal noise in the deconvolution problem re- 
sults in a very poor quality of estimation [13]. In particular, the minimax rate 
of convergence in this case is 0((lnn)^''') with 7 > depending on the expo- 
nent a of the regularity class Ta[L). Fortunately, these pessimistic results 
are concerned with the asymptotic as n — )• 00 behavior of the estimators. We 
observed that the estimation procedures exhibit much better performance 
for small or moderately sized observation samples. On the other hand, this 
performance does not improve when the sample size grows up: in our ex- 
periments, for instance, the estimation accuracy, measured by ^oo-error over 
a regular grid in the distribution domain, improved only by the factor ~ 2 
when we increased the sample size from n = 2,000 to n = 100,000. 

5. Proofs. This section is organized as follows. In Section 5.1 we state 
main results that are used in the proof of Theorems 2.1 and 2.3 and briefly 
discuss the proof outline. Then in Section 5.2 we prove Theorem 2.4. Full 
proofs of all auxiliary results and additional technical details are given in 
the supplementary paper [8]. 

5.1. Proofs of Theorems 2.1 and 2.3. Proofs of Theorems 2.1 and 2.3 
go along the same lines and exploits three basic statements presented here. 
Lemmas 5.1 and 5.2 givenjaelow establish upper bounds on the bias and 
variance of the estimator Fx. Then we present Lemma 5.3 that states an 
exponential inequality on the stochastic error of Fx. This result is used for 
derivation bounds on the e-risk. Finally we briefly explain how the stated 
results are combined in order to complete the proof of Theorems 2.1 and 2.3. 

We start with the standard decomposition of the error of estimator (3). 



\Fx - F{to)\ < \EFx - F{to)\ + \Fx - EFx 
E\Fx-F{to)\^ = Bl{to;F)+E\Vxf, 



Bx{to;F) + \Vx 



where we have denoted 



Bx{to,F):= - 
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and 



^.(A):=-/ -^i , Ma;, j = l,...,n. 

5.1.1. Bounds on bias and variance. First we bound the bias of Fx. 

Lemma 5.1. Let Fx be the estimator defined in (3); then for every class 
J-a{L) with a > — ^, L > and for any A > 1 one has 

(12) sup Bx{F-M)<KqL\-^-^I^, /^o:=y27^[l + (2a + l)-i/']. 

Now we establish an upper bound on the variance of Fx. Recall that uji 
and c* are given in (4) and depend on the constants uq, and r appearing 
in assumption (E2). Define 



(13) w{\) := < 



fA^/'-i, /3>l/2, 
lVln(A/a;i), /3 = l/2, 
1, /3g (0,1/2). 



Lemma 5.2. Let assumptions (El), (E2) hold and Fx be the estima- 
tor defined in (3). Then there exist constants Ki = Ki{a, I3,uji) and K2 = 
K2{/3,u}i) such that for every A > 1 V wi the following statements hold: 

(i) If a + 13 > 1/2, then 

var{FA} < KiLC(;C^'^w{X)n~^ + c^n~^ . 
If (3 > 1, then the upper bound can be made independent of a and L 
var{FA} < KzC^c^^A^^-^n"^ + c^n'^. 

(ii) If a + (3 = 1/2, then 

var{FA} < KiLC(;c^^w{X)y^ln{X/LOi 

(iii) If a + 13 < 1/2, then 

(14) var{FA} < Kic^^ mmlLC^X^^^-f^-" ,ln^{X/u}i) + A^^]?!"^ + c^n-\ 
Explicit expressions for Ki and K2 o.i"s given in the proof; see (A. 12) in [8]. 

R is worth noting that if (3 > 1, then the upper bound on the variance 
of Fx stated in part (i) does not depend on paramaters a and L. This is 
particularly important when the problem of adaptive estimation of F(to) is 
considered. 
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5.1.2. An exponential inequality. First we recall some notation. 




where ui = minjwo, (26^) ^/^}, = 27t ^[2 + (1/t)]^ and constants ojq, 6^ 
and r appear in assumption (E2). Define 



It is easily seen that nix ^ jtiA'^, VA > 1, where m is defined in (6). We also 
put 



ct2 := c, + C^c7^KiLl{(3 < 1) + [(KiL) V A-2]l(/3 > 1)}, 



where constants Ki and K2 are given in (A. 12) in [8]. 

Lemma 5.3. Suppose that assumptions (El) and (E2) hold; then for any 
A > and z > one has 



In particular, if a + (3 > 1/2, then for any A > 1 V (xJi and z > one has 



where w{X) is given in (13). 

5.1.3. Outline of the proofs of Theorems 2.1 and 2.3. The upper bounds 
on the quadratic risk stated in Theorems 2.1 and 2.3 are immediate con- 
sequence of Lemmas 5.1 and 5.2. Balancing the upper bounds on the bias 
and variance with respect to the smoothing parameter A, we come to the 
announced results. Lemma 5.3 along with Lemma 5.1 are used in order to 
derive upper bounds on the e-risk. Full technical details are provided in the 
supplementary paper [8] . 

5.2. Proof of Theorem 2.4- The next preparatory lemma establishes an 
exponential probability inequality on deviation of 5]| from 

Lemma 5.4. Suppose that assumptions (El) and (E2) hold. 
(i) For every A G A 



(15) 




(16) 




n\A-A\>vlm<i^r- 
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(ii) Let q{e) := s/Sln^iNe-^); then for every A G A 
P{\Vx\>qie)vxn~'/^}<J^. 

Proof of Lemma 5.4 is given in [8]. 

5.2.1. Proof of Theorem 2.4- Define the following events: 

A{X) := {\Vx\ < q{e)vxn^'/'} n - Si| < vl/2}, 

A{A) := n ^(A). 
A6A 

It follows from Lemma 5.4 and #(A) = N that P{yl(A)} > 1 - e. By the 
triangle inequality, 

(17) \F~, - F{to)\ < |Fa„ - F{to)\ + \F~^ - FxJ- 

By definition of Aq and by the fact that vx is monotone increasing with 
A we have that u;^n~^/^ > Bx for all A > Aq, where we have denoted Bx '■= 
2(2/7r)V2LA-"-V2. Therefore, on the event ^(A) 

(18) \Fx. - F{to)\ < Bx^ + \VxJ<[l + g(e)]^A„n-i/2. 

Furthermore, if A{A) holds, then for any pair A, /i € A satisfying A > Aq 
and /i > Ao one has Qx n ^ 0. Indeed, by definition of Aq for any A > Aq 
one has Bx < vxl therefore 

I^'a - F(t^)\ <Bx + q{e)vxn-^/^ < [1 + q{e)]vxn~^'\ 
In addition, on the set A[h.) we have 

l~ I ^ l~2 2|l/2 v^2|l/2^ / /7T 

\vx-vx\<\vx-vx\ ' =\T.x-T.x\' <vx/v2. 

This yields 

\Fx-F{to)\< -0:^11 + q{e)]vxn-'/' 

Thus one has -F(to) G Qx and F{to) G for all A > Ao and /i > Ao; hence 
Q^j. n Qx 7^ 0- Then by the procedure definition, A < Ao and Q-^ n Qx^ 7^ 
on the event ^(A). Therefore 

\Fj-FxJ<^n-'^^[vj + vxJ 

(19) < 2'&n-^/'^vXo 

<V2{l + V2)^n-^/\xo- 
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Here the second line follows from < vx^, and the fact that vx^ < (1 + 
2~i/2)^^^ on the event ^(A). Combining (19), (18) and (17) we obtain that 
on the set A{A) 



This completes the proof. 
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Supplement to "On deconvolution of distribution functions" 

(DOI: 10.1214/11-AOS907SUPP; .pdf). In the supplementary paper [8] we 
prove results stated here and provide additional details for the proofs ap- 
pearing in Section 5. In particular, [8] is partitioned in two Appendices, A 
and B. Appendix A contains proofs for Section 2: full technical details for 
Theorems 2.1, 2.3 and 2.4 are presented, and the proof of Theorem 2.2 is 
given. In Appendix B we prove Theorem 3.1 from Section 3. 
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